Text representation

Programming a GPT model.

Published

14 Sep, 2023

Embeddings

Text embeddings are numerical representations of words, sentences or documents. They are used in many NLP tasks, such as sentiment analysis, machine translation, and question answering.

  • Embeddings should capture features of words or concepts, and relationships between these.
  • A sentence embedding is just like a word embedding, except it associates every sentence with a vector full of numbers, capturing similarities between sentences.

are a way of representing words/sentences as vectors of numbers. The idea is that words/sentences with similar meanings will have similar vectors. This is useful for many tasks in natural language processing, such as sentiment analysis, machine translation, and question answering.

You can read more about text embeddings in this 👉 post.

THe following is an example of sentence embeddings, showing the distance between sentences. The distance is calculated using the dot product (cosine similarity) of the embeddings.

Code
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()
import numpy as np
import pandas as pd
from pyobsplot import Plot, d3, Math, js
Code
sentences = [
    "Good morning, how are you?",
    "I am doing well, how about you?",
    "Hi, how are you doing today?",
    "I'm feeling a bit under the weather today.",
    "I like apples.",
        "One of my daughters doesn't like kiwis.",
    "The other doesn't like bananas.",
    "The earth is the third planet from the sun.",
    "The moon is a natural satellite of the earth.",
    "Jupiter is the fifth planet from the Sun and the largest in the Solar System.",
    "The humpback whale is renowned for its enchanting songs, which are believed to serve various purposes, including communication, mating, and navigation during migration.",
    "Dolphins, highly intelligent marine mammals, communicate with each other using a complex system of clicks, whistles, and body language, enabling them to work together in hunting and navigation.",
    "The honeybee, through its pollination efforts, plays a vital role in agriculture, contributing to the growth of many of the fruits and vegetables humans rely on for sustenance.",
    "The Large Plane Trees, also known as Road Menders at Saint-Rémy, is an oil-on-canvas painting by Vincent van Gogh.",
    "Pablo Picasso's Guernica, an iconic mural-sized oil painting, stands as a poignant representation of the horrors of war.",
    "This powerful artwork was created in response to the bombing of the town of Guernica during the Spanish Civil War."
]
Code
embeddings = np.array([embedding.embed_query(sentence) for sentence in sentences])

dot_product_matrix = np.dot(embeddings, embeddings.T)

df = pd.DataFrame(dot_product_matrix, columns=range(1, len(embeddings)+1))

df['embedding_index'] = range(1, len(embeddings)+1)

df = df.melt(id_vars=['embedding_index'], var_name='embedding_index_2', value_name='similarity')
Sentences
1 Good morning, how are you?
2 I am doing well, how about you?
3 Hi, how are you doing today?
4 I'm feeling a bit under the weather today.
5 I like apples.
6 One of my daughters doesn't like kiwis.
7 The other doesn't like bananas.
8 The earth is the third planet from the sun.
9 The moon is a natural satellite of the earth.
10 Jupiter is the fifth planet from the Sun and the largest in the Solar System.
11 The humpback whale is renowned for its enchanting songs, which are believed to serve various purposes, including communication, mating, and navigation during migration.
12 Dolphins, highly intelligent marine mammals, communicate with each other using a complex system of clicks, whistles, and body language, enabling them to work together in hunting and navigation.
13 The honeybee, through its pollination efforts, plays a vital role in agriculture, contributing to the growth of many of the fruits and vegetables humans rely on for sustenance.
14 The Large Plane Trees, also known as Road Menders at Saint-Rémy, is an oil-on-canvas painting by Vincent van Gogh.
15 Pablo Picasso's Guernica, an iconic mural-sized oil painting, stands as a poignant representation of the horrors of war.
16 This powerful artwork was created in response to the bombing of the town of Guernica during the Spanish Civil War.
Code
Plot.plot(
    {
        "height": 640,
        "padding": 0.05,
        "grid": True,
        "x": {"axis": "top", "label": "Embedding Index"},
        "y": {"label": "Embedding Index"},
        "color": {"type": "linear", "scheme": "PiYG"},
        "marks": [
            Plot.cell(
                df,
                {"x": "embedding_index", "y": "embedding_index_2", "fill": "similarity", "tip": True},
            ),
            Plot.text(
                df,
                {
                    "x": "embedding_index",
                    "y": "embedding_index_2",
                    "text": js("d => d.similarity.toFixed(2)"),
                    "title": "title"
                },
            ),
        ],
    }
)

You can see that the sentences are grouped together by similarity. For example, the sentences about fruit are grouped together, and the sentences about planets are grouped together, in the sense that they are similar to each other.

Back to top

Reuse

Citation

BibTeX citation:
@online{ellis2023,
  author = {Ellis, Andrew},
  title = {Text Representation},
  pages = {undefined},
  date = {2023-09-14},
  url = {https://virtuelleakademie.github.io/promptly-literate/pages/text-representation.html},
  langid = {en}
}
For attribution, please cite this work as:
Ellis, Andrew. 2023. “Text Representation.” September 14, 2023. https://virtuelleakademie.github.io/promptly-literate/pages/text-representation.html.